Skip to content

[VL] Support mapping columns by position index for ORC and Parquet files#10697

Merged
rui-mo merged 1 commit intoapache:mainfrom
kevinwilfong:map_by_index
Oct 15, 2025
Merged

[VL] Support mapping columns by position index for ORC and Parquet files#10697
rui-mo merged 1 commit intoapache:mainfrom
kevinwilfong:map_by_index

Conversation

@kevinwilfong
Copy link
Collaborator

@kevinwilfong kevinwilfong commented Sep 12, 2025

What changes are proposed in this pull request?

In our data warehouse we support schema evolution by column index rather than by name. E.g. if a Hive table has schema a, b, c but the partition has schema c, a, b we won't reorder the columns from the partition, but read partition column c as column a, partition column a as column b, etc.

This is supported in Velox by setting the configs hive.orc.use-column-names and hive.parquet.use-column-names in the HiveConfig to false for ORC and Parquet files respectively. Currently these are both hard coded to true in Gluten. This change adds configs to Gluten's VeloxConfig spark.gluten.sql.columnar.backend.velox.orcUseColumnNames and spark.gluten.sql.columnar.backend.velox.parquetUseColumnNames and plumbs these to the HiveConfig in Velox.

In addition, we need to pass the full table schema to the HiveTableHandle, as this is how Velox determines the indices of each column. I updated VeloxIteratorApi to set the FileSchema for the LocalFilesNodes it generates if necessary (if the config is enabled for the format of the file), and VeloxPlanConverter/SubstraitToVeloxPlan to propagate this to the HiveTableHandle when present.

Note that I considered just setting it in the ReadRel rather than in each LocalFilesNode. This however introduced the problem that we could no longer read from tables with column types we don't support, even if we don't read those columns, as we still need to propagate them to the HiveTableHandle. Since partition file formats don't always match table file formats, we don't know if we need the schema until we generate the splits, at which point it's too late to update the plan. See #10569

Please note that vanilla Spark partially supports matching by the position index https://issues.apache.org/jira/browse/SPARK-32864 and can be extended to do so by customizing readers.

How was this patch tested?

Added tests for ORC and Parquet files where the column names in the table don't match the column names in the file, and verified we could still read them by index when the flags are enabled.

@github-actions github-actions bot added CORE works for Gluten Core VELOX labels Sep 12, 2025
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Oct 3, 2025

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Oct 3, 2025

Run Gluten Clickhouse CI on x86

@kevinwilfong kevinwilfong requested a review from rui-mo October 6, 2025 16:52
@github-actions
Copy link

github-actions bot commented Oct 7, 2025

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions
Copy link

github-actions bot commented Oct 8, 2025

Run Gluten Clickhouse CI on x86

Copy link
Contributor

@Yohahaha Yohahaha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! thank you!

@github-actions
Copy link

github-actions bot commented Oct 9, 2025

Run Gluten Clickhouse CI on x86

Copy link
Contributor

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. Just one nit and the other change LGTM.

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

Copy link
Contributor

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here’s what comes to mind. There are the below three strategies for column mapping:

  1. Match by position
  2. Match by field name
  3. Match by unique permanent ID

And I suppose Spark only supports (2) and (3) (seeing facebookincubator/velox#6065 (comment)), while Velox supports (1) and (2). Would you please clarify which is supported in this PR?

@kevinwilfong
Copy link
Collaborator Author

Here’s what comes to mind. There are the below three strategies for column mapping:

  1. Match by position
  2. Match by field name
  3. Match by unique permanent ID

And I suppose Spark only supports (2) and (3) (seeing facebookincubator/velox#6065 (comment)), while Velox supports (1) and (2). Would you please clarify which is supported in this PR?

@rui-mo This exposes Velox's support for (1)

@rui-mo
Copy link
Contributor

rui-mo commented Oct 14, 2025

@kevinwilfong I wonder if Spark supports matching by the position index, and I assumed it only supports matching by the file ID or column name. Please correct me if it's wrong.

@kevinwilfong
Copy link
Collaborator Author

kevinwilfong commented Oct 14, 2025

Vanilla Spark partially supports it https://issues.apache.org/jira/browse/SPARK-32864 and can be extended to do so by customizing readers (it's what we do)

@rui-mo rui-mo changed the title [VL] Support mapping columns by index for ORC and Parquet files [VL] Support mapping columns by position index for ORC and Parquet files Oct 14, 2025
@rui-mo
Copy link
Contributor

rui-mo commented Oct 15, 2025

cc: @zhztheplayer If there are no further comments, we can proceed to merge this PR.

@zhztheplayer
Copy link
Member

@rui-mo Please proceed. Thank you very much for the review.

@rui-mo rui-mo merged commit 6548ab4 into apache:main Oct 15, 2025
62 of 63 checks passed
@beliefer
Copy link
Contributor

beliefer commented Nov 3, 2025

@kevinwilfong I encountered the issue too. I picked up this PR to my Gluten. But the problem still exists.
We created the table with Hive client.

CREATE TABLE default.test_orc_table_hive_gluten
(
    id int,
    name string
)
PARTITIONED BY (dt string)
STORED AS ORC;

insert into test_orc_table_hive_gluten partition(dt='20240728') values (1, 'a'),(2,'b');

And query this table with Spark SQL.

select * from test_orc_table_hive_gluten where dt = '20240728';

The output show below.

NULL	NULL	20240728
NULL	NULL	20240728

I set these configs show below.

set spark.gluten.sql.columnar.backend.velox.orcUseColumnNames=false; 
set spark.gluten.sql.complexType.scan.fallback.enabled=false;

But nothing helped.

Did I miss something?

@kevinwilfong
Copy link
Collaborator Author

@beliefer That's strange, do you get the same results if you don't set spark.gluten.sql.columnar.backend.velox.orcUseColumnNames to false?

Nothing in your repro changes the schema of your data, so it should work regardless of the value of that config

@beliefer
Copy link
Contributor

beliefer commented Nov 4, 2025

@kevinwilfong Yes. No matter the value of spark.gluten.sql.columnar.backend.velox.orcUseColumnNames, the output is wrong.

@kevinwilfong
Copy link
Collaborator Author

In that case, as was mentioned in the Issue that was linked, I think it's not related to this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants